UNC Charlotte-Virus-Tracker

VAST 2011 Challenge

Mini-Challenge 1 - Characterization of an Epidemic Spread

Authors and Affiliations: Tools:

Text processing and analysis was performed using the Natural Language ToolKit(NLTK, www.nltk.org). Custom python(http://python.org) scripts were written to call NLTK functions. All of the visualizations were implemented using SDL and SDL_Image(www.libsdl.org) and OpenGL. All of the work was performed at UNC Charlotte. Four python scripts (49-170 lines each) were used in the text processing. NLTK involves a fairly steep learning curve. Our group had prior experience with SDL and OpenGL. The visualization application took perhaps a month of effort to put together.

Text Processing: The microblog file was processed using NLTK to produce a ranked, time ordered blog. This involved (1) keyword search, involving domain relevant words like 'flu', 'sweats', etc. (2) custom python scripts produced 30 random concordances of relevant words. Visual inspection of the concordances is used to build the grammar search scripts, (3) Grammar extraction ranked the blog entries into 7 categories(1-7, 1 is the most relevant), (4) Similar procedure produced a ranked event file.

Visualization Tools: Our virus tracker application displays all output superimposed on the Vastopolis map with the following major features, (1) All blogs are points on the map and can be 'played' over time (yellow dots in Fig. 1) (2) Blogs can be filtered by rank (3) Blogs at an event are separately displayed, (4) areas of above average numbers of sick people can be highlighted as rectangular regions, permitting trends to be clarified (4) Upto 3 blog filters on specific terms and correspoding highlighting (used in identifying the water-borne outbreak).

Video:

[MOVIE(MPEG v4)]

MC 1.1 Origin and Epidemic Spread: Identify approximately where the outbreak started on the map (ground zero location). If possible, outline the affected area. Explain how you arrived at your conclusion.

Fig. 1: Initial Outbreak (Downtown Area): May 18, 2.26am-8.26am

The outbreak began on May, 18th at approximately 8AM in the Downtown region (Figs. 1, 2). Ground Zero is likely located within a short range of the Vastopolis Dome, and the affected area extends eastward throughout Downtown towards Interstate 278 and the east side. This conclusion was drawn by scanning a time lapsed plot of a sub-set of the micro-blog data related to illness. After seeing signs of the initial event (the three clusters in Figs. 1,2), we performed a minute by minute viewing of the time frame in question using an automated player. The play-through was refined by filtering blog data by ranks, encoded within the subset representing the likelihood that the blogger is actually sick. This filtering eliminated noise and allowed clear clusters. The clusters were sampled by clicking points(blogs), which allowed the messages to be read and confirmed.

MC 1.2 Epidemic Spread: Present a hypothesis on how the infection is being transmitted. For example, is the method of transmission person-to-person, airborne, waterborne, or something else? Identify the trends that support your hypothesis. Is the outbreak contained? Is it necessary for emergency management personnel to deploy treatment resources outside the affected area? Explain your reasoning.

Hypothesis: Our analysis detected two distinct outbreaks, the first of which we suspect to be airborne and occurring in the Downtown district on May 18th. The second is a waterborne contaminant which manifests itself on the 19th and makes its way mostly out of the area by the 20th. It is our belief that the airborne outbreak is largely contained by the 20th; however it may progress westward with the wind to some degree, although lesser than the initial outbreak. The waterborne contaminant may still be making its way down the Vast River and is worth noting to emergency management personnel overseeing areas tied to the river.

Fig. 2: Initial Outbreak (Downtown Area): May 18, 6.30am-12.30pm. Notice the spread eastwards along Interstate 278.

Beginning with the suspected airborne virus, a high number of reported symptoms consistent with those of a common flu or cold come from the Downtown area on the morning of the 18th, as illustrated in Figures 1 and 2, respectively at 2.26am-8.26am and 6.30-12.30am intervals. This outbreak spreads locally in an extremely rapid fashion, particularly within densely populated areas or centers of activity. By super-imposing another subset that we derived from the micro blog data related to event activity, we can see there is a large technology convention that occurs slightly after the initial outbreak, which would help to propagate airborne spread. This is shown in Fig. 3, where the blue dots are related to occupants in the convention center. Although the affected area also extends eastwards, this can likely be attributed to Interstate 278 being the most direct route to the Uptown/Downtown districts.

Fig. 3: Convention Center Overlay: May 18, 4.29am-10.29am. The blue dots indicate blog posts from the Convention Center.

By continuing to monitor the location of virus reports, we could see the virus popping up across the entire map(Fig. 4), at which point it was helpful to apply a custom ranking filter(as per our text processing described earlier), which isolated posts more likely to have been made by a sick individual. By that evening, the virus had reached most areas of the map. More detailed analysis revealed that this happens around 6PM, which is around the hour we expect those working in the city to be arriving home. Even though the virus had left the Downtown region, its spread from outside areas was nowhere near as rapid as it was Downtown, which supports that the virus spreads more easily through close human contact.

Fig. 4: May 19, 11.57am-5.57pm. Data showing all sick people (no filtering).

On the 19th we were able to detect the virus trending largely in two directions. We further isolated this by overlaying an adaptive grid, as seen in Fig. 5. The grid automatically computes an average number of data points for each sector based on the initial data set, and highlights areas which are currently above their average according to a set threshold. The resolution of the grid can be adjusted, for instance, a lower resolution grid will compute a more accurate average if the initial data set is not very dense. By using this we clearly defined a trend along the interstates through the east side, and another along the banks of the Vast River, moving southwest, and originating close to the Downtown area.

Fig. 5: May 19, 11.57am-5.57pm. Overlaid grid to better illustrate trends in viral spread.

The trend along the interstates was expected, however the trend along the river was a new development which needed further analysis. To deal with this, we randomly sampled a few points from the river to identify common symptoms. This quickly yielded results related to 'stomach' pain, 'nausea', 'diarrhea', and the like, whereas other regions seemed to have symptoms closer to a 'fever or 'flu'. In order to rapidly verify this in a visual sample, we developed a custom filter tool, which can search the raw data set on the fly, and highlight matches in a given color. Doing this quickly gave us visual confirmation that there was indeed a separate virus likely making its way down the river. By highlighting flu symptoms in one color and common waterborne symptoms in another, we were able to see that each was distinct with very little overlap. This is evident in Fig. 6, with the green dots corresponding to blogs matching "nausea" or "stomach", while the red dots match "fever".


Fig. 6: May 19, 10.22am-4.22pm. In order to highlight the waterborne contaminant and its spread, specialized filters searching for 'nausea', 'fever' and 'stomach' highlight and distinguish the two types of viral spreads(green versus red dots(blogs)). It can be clearly seen that the green dots are flowing along the river. .

Further confirming the waterborne contaminant was the fact that its progression down the river was consistent with the flow of the river, and it spread southward and out of Vastopolis in a visible and predictable fashion. We were able to track these ailments which mostly made their way out of the region by the morning of the 20th, with just a few lingering traces.

At this point the airborne virus also seems to be in recession, with many of the infected left in the hospitals, whereby taking samples from the hospitals would seem to indicate that the virus has been confirmed as some type of flu.